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Abstract: Wc consider the sparse regression model where the number of 
parameters p is larger than the sample size n. The difficulty when con- 
sidering high-dimensional problems is to propose estimators achieving a 
good compromise between statistical and computational performances. The 
Lasso is solution of a convex minimization problem, hence computable for 
large value of p. However stringent conditions on the design are required 
to establish fast rates of convergence for this estimator. Dalalyan and Tsy- 
bakov [17-19] proposed an exponential weights procedure achieving a good 
compromise between the statistical and computational aspects. This esti- 
mator can be computed for reasonably large p and satisfies a sparsity oracle 
inequality in expectation for the empirical excess risk only under mild as- 
sumptions on the design. In this paper, we propose an exponential weights 
estimator similar to that of [17] but with improved statistical performances. 
Our main result is a sparsity oracle inequality in probability for the true 
excess risk. 
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1. Introduction 



We observe n independent pairs (Xi, Yi), 
measurable set) such that 



(X n , Y n ) e X x R (where X is any 



(1.1) 



where / : X — > R is the unknown regression function and the noise variables 
Wi, . . . , W n are independent of the design (X±, . . . , X n ), satisfy also EWi = 
and VW 2 ^ a 2 for any 1 ^ i ^ n and for some known a 2 > 0. The distribution 
of the sample is denoted by P, the corresponding expectation is denoted by E. 

1 /2 

For any function g : X — > R define \\g\\ n = (XIi=i 9(^i) 2 / n ) an d = 

1/2 

(E||p||^) . Let T = {</>i, . . . , 4> p } be a set — called dictionary — of functions 
4>j : X — > R such that = 1 for any j (this assumption can be relaxed). For 
any 9 G R p define fg = J2 P j=i Qj'ftji tne empirical risk 



r(9) 



1 ™ 



fe(Xi 



and the integrated risk 



R{9) = E 



n 

-£(V/ -MXI))' 



where {(X[,Y{), . . . , (X^, Y£)} is an independent replication of {(Xi,Yi), . . . , 
(X„,y n )}. Let us choose 9 G argminggRp Note that the minimum may 

not be unique, however we do not need to treat the identifiability question since 
we consider in this paper the prediction problem, i.e., find an estimator 9 n such 
that R(9 n ) is close to minegRp R{9) up to a positive remainder term as small as 
possible. 

It is a known fact that the least-square estimator 9^ SE <E argmhiggRp r(9) 
performs poorly in high-dimension p > n. Indeed, consider for instance the 
deterministic design case with i.i.d. noise variables N(0, a 2 ) and a full-rank 
design matrix, then 9 LSE satisfies 



E 



WfosB - f\\l -\\fe-f\\l = <? 2 
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In the same context, assume now there exists a vector 9 G argminggRp R{9) 
with a number of nonzero coordinates po < n. If the indices of these coordinates 



The estimator 9® n is called oracle estimator since the set of indices of the nonzero 
coordinates of 9 is unknown in practice. The issue is now to build an estimator, 
when the set of nonzero coordinates of 9 is unknown, with statistical perfor- 
mances close to that of the oracle estimator 

A possible approach is to consider solutions of penalized empirical risk min- 
imization problems: 



where the penalization pen(9) is proportional to the number of nonzero com- 
ponents of 8 such as for instance AIC, C p and BIC criteria [1, 31, 37]. Bunea, 
Tsybakov and Wegkamp [9] established for the BIC estimator 9^ IC the fol- 
lowing non-asymptotic sparsity oracle inequality. For any e > there exists a 
constant C(e) > such that for any p ^ 2, n ^ 1 we have 



Despite good statistical properties, these estimators can only be computed in 
practice for p of the order at most a few tens since they are solutions of non- 
convex combinatorial optimization problems. 

Considering convex penalty function leads to computationally feasible opti- 
mization problems. A popular example of convex optimization problem is the 
Lasso estimator (cf. Frank and Friedman [22], Tibshirani [39], and the parallel 
work of Chen, Donoho and Saunders [15] on basis pursuit) with the penalty 
term pcn((9) = A|0|i, where A > is some regularization parameter and, for any 
integer d > 2, real q > and vector z £ R d we define \z\ q = (X}j=i kjl) 1 ^ 9 
and |z|oo = maxixjxd \zj\. Several algorithms allow to compute the Lasso for 
very large p, one of the most popular is known as LARS, introduced by Efron, 
Hastie, Johnstone and Tibshirani [21]. However, the Lasso estimator requires 
strong assumptions on the matrix A = (</>j(A.;))i^j^ n ,i^;Kp to establish fast 
rates of convergence results. Bunea, Tsybakov and Wegkamp [8] and Lounici 
[30] assume a mutual coherence condition on the dictionary. Bickel, Ritov and 
Tsybakov [7] and Koltchinskii [25] established sparsity oracle inequalities for the 
Lasso under a restricted eigenvalue condition. Candes and Tao [11] and Koltchin- 
skii [26] studied the Dantzig Selector which is related to the Lasso estimator and 
suffers from the same restrictions. See, e.g., Bickel, Ritov and Tsybakov [7] for 
more details. Several alternative penalties were recently considered. Zou [45] 
proposed the adaptive Lasso which is the solution of a penalized empirical risk 
minimization problem with the penalty pen(#) = AJ^ =1 pJ-||#j| where w is 
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an initial estimator of 9. Zou and Hastie [46] proposed the elastic net with the 
penalty pen(0) = Ai|0|i + A2|0||, Ai, A2 > 0. Meinshausen and Buhlmann [35] 
and Bach [6] considered bootstrapped Lasso. See also Ghosh [23] or Cai, Xu 
and Zhang [10] for more alternatives to the Lasso. All these methods were moti- 
vated by their superior performances over the Lasso either from the theoretical 
or the practical point of view. However, strong assumptions on the design are 
still required to establish the statistical properties of these methods (when such 
results exist). A recent paper by van de Geer and Buhlmann [42] provides a 
complete survey and comparison of all these assumptions. 

Simultaneously, the PAC-Bayesian approach for regression estimation was 
developed by Audibert [4, 5] and Alquier [2, 3], based on previous works in 
the classification context by Catoni [12-14], Mc Allester [34], Shawe- Taylor and 
Williamson [38], see also Zhang [44] in the context of density estimation. This 
framework is very well adapted for studying the excess risk R(-) — R(9) in the 
regression context since it requires very weak conditions on the dictionary. How- 
ever, the methods of these papers are not designed to cover the high-dimensional 
setting under the sparsity assumption. Dalalyan and Tsybakov [16-19] propose 
an exponential weights procedure related to the PAC-Bayesian approach with 
good statistical and computational performances. However they consider deter- 
ministic design, establishing their statistical result only for the empirical excess 
risk instead of the true excess risk R(-) — R(6). 

In this paper, we propose to study two exponential weights estimation proce- 
dures. The first one is an exponential weights combination of the least squares 
estimators in all the possible sub-models. This estimator was initially proposed 
by Leung and Barron [28] in the deterministic design setting. Note that in the 
literature on aggregation and exponential weights, the elements of the dictionary 
are often arbitrary preliminary estimators computed from a frozen fraction of 
the initial sample so that these estimators arc considered as deterministic func- 
tions, the aggregate is then computed using this dictionary and the remaining 
data. This scheme is referred to as 'data splitting'. See for instance Dalalyan and 
Tsybakov [18] and Yang [43]. Leung and Barron [28] proved that data splitting 
is not necessary in order to aggregate least squares estimators and raised the 
question of computation of this estimator in high dimension. In this paper we 
explicit the oracle inequality satisfied by this estimator in the high-dimensional 
case. For the second procedure, the design may be either random or determinis- 
tic. We adapt to the regression framework PAC-Bayesian techniques developed 
by Catoni [14] in the classification framework to build an estimator satisfying a 
sparsity oracle inequality for the true excess risk. Even though we do not study 
the computational aspect in this paper, it should be noted that efficient Monte 
Carlo algorithms are available to compute these exponential weights estimators 
for reasonably large dimension p (p ~ 5000), see in particular the monograph 
of Marin and Robert [32] for an introduction to MCMC methods. Note also 
that in a work parallel to ours, Rigollct and Tsybakov [36] consider also an 
exponential weights procedure with discrete priors and suggest a version of the 
Metropolis-Hastings algorithm to compute it. 
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The paper is organized as follows. In Section 2 we define an exponential 
weights procedure and derive a sparsity oracle inequality in expectation when 
the design is deterministic. In Section 3, the design can be either deterministic 
or random. We propose a modification of the first exponential weights procedure 
for which we can establish a sparsity oracle inequality in probability for the true 
excess risk. Finally Section 4 contains all the proofs of our results. 



2. Sparsity oracle inequality in expectation 

Throughout this section, we assume that the design is deterministic and the 
noise variables W\,..., W n are i.i.d. gaussian N(0, a 2 ). 
For any J C { 1, . . . , p} and K > define 



6(J) = 1 61 G W : Vj i J, Oj = j, (2.1) 

and 

Q K (J) = jf e « p : \6\i<K and V j £ J, 9j = oj. (2.2) 

For the sake of simplicity we will write ®k = • ■ • ,p})- 

For any subset J C {l,...,p} define 

6j G arg min r(0), 
eee(j) 

where r{6) = IJ^iW " U&i)? = W ~ fe\\l with Y = (Y 1 , . . . , Y n ) T . 
Denote by V n ({l, . . . ,p}) the set of all subsets of {1, . . . ,p} containing at most 
n elements. The aggregate /„ is defined as follows 

where A > is the temperature parameter, ir is the prior probability distri- 
bution on V({1, . . . ,p}), the set of all subsets of {1, ...,p}, that is, for any 
J £ {1, . . . ,pj, ttj > and E./ e p({i,...,p}) *J = L 

The next result is a reformulation in our context of Theorem 8 of [28] . 

Proposition 2.1. Assume that the noise variables W\, . . . , W n are i.i.d. N(0, a 2 ) 
Then the aggregate 6 n defined by (2.3) with < A ^ -fa satisfies 



E 



^ min J E [r(0j)] + y log (— ] 



Proposition 2.1 holds true for any prior ir. We suggest using the following 
prior. Fix a £ (0, 1) and define 



7rj = — ri jl 1 - ) , VJeP„({l,...,p}), and ttj = if |J| >n. (2.5) 



132 



P. Alquier and K. Lounici 



As a consequence, we obtain the following immediate corollary of Proposi- 
tion 2.1. 

Theorem 2.1. Assume that the noise variables W\, . . . , W n are i.i.d. N(0, a 2 ). 
Then the aggregate f n = fg , with A = and it taken as in (2.5), satisfies 



E \\fn 



^ min <^ \\f e - ff n + ( 41og . 



«f4lj^ + ll + ^ 



(2.6) 



where for any 6 G W J{6) = {j : 6j ^ 0} . 

This result improves upon [17] which established in Theorem 6 for gaussian 
noise and deterministic design 



E 



\fn 




where log + x = max{loga;,0} and we recall that A = (cftj (Xi))i<i< n! i<j< p . 
Note that our bound is faster by a factor log, \8\i. Note also that the bound in 
the above display grows worse for large values of \6\i. 

In order to evaluate the performance of these exponential weights procedures, 
[40] developed a notion of optimal rate of sparse prediction. In particular, [36] 
established that there exists a numerical constant c* > such that for all 
estimator T n 



.a 2 



sup su P {E[||T„-/||2] _||/,_/||2}> c *_ rank(A)Aslog 1 + 
e eR p\{ } : / n L V 

\j(e)\< s 



The above display combined with Theorem 2.1 shows that the exponential 
weights procedure (2.3) with the prior (2.5) achieves the optimal rate of sparse 
prediction for any vector satisfying | J(9)\ < t^0j^L ■ 



3. Sparsity oracle inequality in probability 

In Section 2 we assumed the design is deterministic and we established an or- 
acle inequality in expectation with the optimal rate of sparse prediction. We 
want now to establish an oracle inequality in probability that holds true for 
deterministic and random design. 

Prom now on, the design can be either deterministic or random. 
We make the following mild assumption: 

L = max ||oi> 7 -||oo < oo. 
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We assume in this section that the noise variables are subgaussian. More 
precisely we have the following condition. 

Assumption 3.1. The noise variables W\, . . . , W n are independent and inde- 
pendent of X\ , . . . , X n . We assume also that there exist two known constants 
a > and £ > such that 

E(Wi) < cr 2 

Vfc>3, E(\Wi\ k ) <a 2 kl£ k - 2 . 

The estimation method is a version of the Gibbs estimator introduced in 
[13, 14]. Fix K > 1. First we define the prior probability distribution as follows. 
For any J C {!,... ,p] let uj denote the uniform measure on Ok+i(J)- Wc 
define 

m(d9) = Kjuj(d6) 

JC{1,-,P} 

with 7r taken as in (2.5). 

We are now ready to define our estimator. For any A > we consider the 
probability measure p\ admitting the following density w.r.t. the probability 
measure m 

doy e -M*) 

J±(6) = / r — . (3.1) 
dm y ' L e- Xr dm y ' 

J<d>K 

The aggregate /„ is defined as follows 

fn = f§, e n = O n (\,m)= f 6p x {d6). (3.2) 

Define 

d = [8a 2 + (211/Hoo + L(2K + l)) 2 ] V[8[^ + (2||/|U + L(2K + 1))]L(2K + 1)] . 
We can now state the main result of this section. 

Theorem 3.1. Let Assumption 3.1 be satisfied. Take K > 1 and A = A* = 
Assume that argmin^gRp R{0) H Ok 0- Then we have, for any e G (0, 1) and 
any 9 G argminggRp R{6) H Ok, with probability at least 1 — e, 



< R{0) + % + ^ 



\J(6)\log(K + l) 



^)|log(^^)+log 



\J(0)\J °\e{l-a 

The choice A = A* comes from the optimization of a (rather pessimistic) 
upper bound on the risk R (see Inequality (4.9) in the proof of this theorem, 
page 142) . However this choice is not necessarily the best choice in practice even 
though it gives the good order of magnitude for A. The practitioner may use 
cross-validation to properly tune the temperature parameter. 
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Theorem 3.1 improves upon previous results on the following points: 

1) Our oracle inequality is sharp in the sense that the leading factor in front 
of R{6) is equal to 1 and we require only maxj \\4>j\\ < oo whereas ^-penalized 
empirical risk minimization procedures such as Lasso or Dantzig selector have a 
leading factor strictly larger than 1 and impose in addition stringent conditions 
on the dictionary (cf. [7, 11]). For instance, [7] imposes the design matrix A = 
{4>j{Xi))i<i< n ^<j<p to be deterministic and to satisfy the following restricted 
eigenvalue condition for the Lasso 

k(s) = mm 

JoC{l,...,p} 

\JoKs 

where for any A £ MP and J C {1, . . . ,p}, we denote by Aj the vector in M p 
which has the same components as A on J and zero coordinates on the comple- 
ment J c . Assuming in addition that the noise is gaussian 7V(0,cr 2 ) and taking 

A = Aa^j ^p, A > 8, [7] proved that the Lasso 6 L satisfies with probability at 

least 1 - M l - A2 ' & 

-llfo - f\\l < (1 + v)-\\fo - /II' + C( V )A 2 o*\ J(0)\—, V0 G 
n n n 

where 77, C(rj) > and C(rj) increases to +00 as r\ tends to 0. 

On the downside, our estimator requires the additional condition \8\i < K. 
This condition is common in the PAC-bayesian literature. Removing this condi- 
tion is a difficult problem and does not seem possible with the actual techniques 
of proof where this condition is needed in order to apply Bernstein's inequality. 

2) We establish a sparsity oracle inequality in probability for the integrated 
risk R(-) whereas previous results on the exponential weights are given in ex- 
pectation [16, 17, 19, 24, 29, 36]. 

3) Unlike mirror averaging or progressive mixture rules, satisfying similar 
inequalities in expectation, our estimator does not involve an averaging step 
[18, 24, 29]. As a consequence, its computational complexity is significantly 
reduced as compared to those procedures with averaging step. For instance 
[29] considered the model (1.1) with random design and i.i.d. observations 
(X\, Y\), , . , , (X n , Y n ), 77 > 2. The studied estimator is the following mirror 
averaging scheme 

n — 1 

faMA, 6 = — y 6k, 

71 ^— ' 

fe=0 

where 9q = f e r K \ OdH and for any 1 < k < n — 1, 6k is defined similarly as 

6 n in (3.1)-(3.2) with r(6) replaced by r k {6) = ±£- = i(^ - fe(X t )) 2 . These 
estimators can be implemented for example by MCMC. In this case, computing 
the integral j Q , K \ 6p\{d6) is the most time-consuming part of the procedure. 



L4A 2 

mm — — > 0, 

AGR"\{0}: V^|Aj | 2 
|Ajc|i<3|Aj |i 
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The procedure (3.1)-(3.2) requires computing this integral only once whereas 
the mirror averaging procedure of [29] requires computing integrals of this form 
n times. 

4) Under the assumption < K for some absolute constant K and taking 
e = n _1 we have with probability at least 1 — n^ 1 

R{h) < R(0) + C^\J(9)\ log (-^L) + (3.3) 
n \a\J(u)\J n A 

for some absolute constant C > 0. In [36] a minimax lower bound in expectation 
is established for deterministic design and Gaussian noise. A similar result holds 
in probability with the same proof combined with Theorem 2.7 of [41]. There 
exists absolute constants c\ , ci > such that 



inf sup sup 1 

\j(e)\<s 



2 

\\T n - f\\l >\\fe~ f\\l + ci— rank(A) A .slog (l + ^ 

n \ s 



>c 2 . 



If s < j^ggy then we observe that the upper bound in (3.3) is optimal up to 

the additional logarithmic factor logn. Note however that if p > n 1+s for some 
8 > 0, which is relevant with the high-dimensional setting we consider in this 
paper, then our bound is rate optimal. 



4. Proofs 

4-1- Proofs of Section 2 

This proof uses an argument from Leung and Barron [28] . 

Proof of Proposition 2.1. The mapping Y f n {Y) = (f n (X\, Y), f n (X n , 
Y)) T is clearly continuously differentiable by composition of elementary differen- 
tiable functions. For any subset J C {1, . . . ,p} define Aj = {<pj{Xi))i<i< n 
Ej = \A T jAj, <&,(•) = (^(-)) 3 - e j and 

gj = e -^\\y-f.A\l + ^) 

where 

fj(x,Y) = -Y T Ajj:+^>j(x) T , 
n 

and Sj denotes the pseudo- inverse of Sj. Denote by di the derivative w.r.t. Yi. 
Simple computations give 

difj(x,Y) = i$ J (X 4 )S+$. / (x) T , 
n 

{8ifj{X u Y), . . .,difj(X n ,Y))Y = fj(X h Y), 
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and 

n 

Y,fj{Xi,Y)difj{XhY) = fj(Xi,Y). 
1=1 

Thus we have 

d^gj) = -\d t {\\Y-fj\\l)gj 

2A 

n 

■2\ 



Recall that 

We have 
dJ n (X u Y) 



\[Yi - fj(X h Y)) - J2 W X u Y)(Yi - fj(X h Y))j gj 
(Yi — fj(X i: Y))gj, 

f, Y )) = ^J€P„({i,...,ri)W/.?( , ' y ) 
£jep„({i,...,p}) nj 9J 

Ejev n ({i,.., P })^J (^(9j)fj(X t ,Y) + gjdUjj^Y))) 

Ysjev n ({i,...,p}) nj 9J 
_ (Ejg7V{i,..., P }) KJ9jfj{Xi,Yj) (Sjgp„({i,..., p » Kjdjjgj)) 
(E.j e v n ({i,...,p}) 7r J9j) 2 
2X f t 2A Sjgp„({i,...,p» fj( x ^ Y) 2 njgj 
n » ^J67»»({i,...j.}) 
1 Ej eP „({i Pl) $j(X i )S+$ J (X i ) T 7 r JgJ 

n EjeP„({l,...,p}) 7r ^7 

-^Yj^X^-^fUX^Y) 
n n 

2\T l Jev n ({i,...,p})Uj{X u Y)-f n {X i ,Y)f^jg J 



Sj£P„({l,...,p}) ^./ff./ 
,JGP„({l,...,p}) 



1 EiEPJfl,.,,}) $j(X i )S>j(X l ) T 7T j5 , / 



n T,jev n {{i,..., P }) 7r J9.J 
Consider the following estimator of the risk 



> 0. (4.1) 



9rr 2 

f n (Y) = \\f n (Y) Y\\l + —J^diMX^Y) a 2 . (4.2) 



n 

i=l 



Using an argument based on Stein's identity as in [27] we now prove that 
E[f n (Y)]=E\\\f n (Y)-f\\i . 
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Wc have 



E 



\\fn(Y)-f\\ 2 n 



\\f n (Y) - Y\\l + WiiUXuY) - KX,)) 

71 ^ ' 



3 



\\f n (Y)-Yf n + -J2w i f n (X i ,Y) 



a 2 . (4.3) 



For z = (zi, . . . , z n ) T e M™ write F^i(z) = Il^i Fw,i(zj), where Fw denotes 
the c.d.f. of the random variable W\. Since E(Wi) =0we have 



E 



Wif n (Xi,Y) 



Wi / d i f n (X i ,Y 1 ,...,Y i _ 1 ,f(X i )+z,Y i+1 ,...,Y n )dz 



y d i f n (X i ,f + z)dz i dF w (y))dF w , i {z). (4.4) 



In view of (4.1) we can apply Fubini's Theorem to the right-hand-side of (4.4). 
We obtain under the assumption W ~ N(0, a 1 ) that 

i-V 



difn(Xt,f + z)dz l dF w (y) 



+ Jo 



ydF w {y)dif n {X u f + z)dz t 



a 2 dJ n (X l J + z)dF w {z l ), 



A Similar equality holds for the integral over R . Thus we obtain 



E 



CT 2 E 



dif n (Xi,Y) 



Wif n (Xi,Y) 

Combining (4.2), (4.3) and the above display gives 
E[r n (y)]=E[||/ n 00-/ll« 



Since f n (-,Y) is the expectation of fj(-,Y) w.r.t. the probability distribution 
oc g ■ 7r, we have 



\\U,Y)-Y\\ 



E./ e p„( { w } )(ll/j(-^)-y||^-ii/./(-,y)~A(y)ii^)gj 7 r J 

2jGP„({l,...,p}) SJ^J 



For the sake of simplicity set fj = fj(-, Y) and /„ = f n {-, Y). Combining (4.2), 
the above display and A ^ -ps yields 



fn(Y) 



/e-p„({i,..., P }) 



fj Y\\l + ELA^T ~ " /nlln 



Eje7»„({i,..., P }) ^ffJ 
2a 2 ^ Ej £ k({i,.,p}) a 



j£Vn({l,-,p}) 



E./ep„({i,..., P }) ^Jfl'J 
2a 2 
n 
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By definition of gj wc have 

2 ^VI 1 ,„ ( 9J 



fj-Y 




Jev n ({i,..., P }) 9jKj 
Jev n ({i,..., P }) 

Integrating the above inequality w.r.t. the probability distribution j^g-ir (where 
C = Jev ({i pi) 9J n J 1S ^ ne normalization factor) and using the fact that 

J2 ^9.jkj log ( ^gj ) = K , tt) ^ 



j£Vn({l,...,p}) V J 

as well as a convex duality argument (cf., e.g., [20], p. 264) we get 

JeP»({i,...,p}) V 1 

for all probability measure tt' on 7 7 ({l,...,p}). Taking the expectation in the 
last inequality wc get for any tt' 

Wfn-fWl] =E[f„(F)] 

E[||/j - Y\\l] + ^| jf) ttS + ^(t/.tt) - a 2 




/i | 2 ] + -^E[^/ J (^,r)] + — |j| 



+ -K(tt',tt) 

Je-p n ({i,...,p}) V 7 

where we have used Stein's argument E[Wifj(Xi,Y)] = a 2 E [difj(X i: Y)} and 
the fact that Y17=i ®ifj(Xi,Y) — 1 in the last line. Finally taking tt' in the set 
of Dirac distributions on the subset J of {1, ... , p} yields the theorem. □ 

Proof of Theorem 2. 1 . First note that any minimizer 9 £ W of the right-hand- 
side in (2.6) is such that \J(9)\ ^ rank(A) ^ n where we recall that A = 
((j)j(Xi))i^i^ n .i^j^p. Indeed, for any 9 £ W such that | J(6)\ > rank(A) we can 
construct a vector 9' £ MP such that fg = fg> and | J(9')\ ^ rank(A) and the 
mapping x — > a; log (^— ) is nondecreasing on (0,p]. 
Next for any J £ V n ({l, . . . ,p}) we have 



m., - mi = » e in ){ ii/» - m) + ^ . ^ U - /B + M] 
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Thus 



mm 1 E[||/ J -/0 + ibg(-L 



(7 J 

n 



min min { ||/<? - /||^ + ^ log ( — !— 



a 2 |J(0)| 



ll/fl-/||^ + Ylogf— 



a 2 |J(0)| 



Combining the above display with Proposition 2.1 and our definition of the prior 
7r gives the result. □ 

4.2. Proof of Theorem 3.1 

We state below a version of Bernstein's inequality useful in the proof of Theorem 
3.1. See Proposition 2.9 page 24 in [33], more precisely Inequality (2.21). 

Lemma 4.1. Let T\, . . . , T n be independent real valued random variables. Let 
us assume that there is two constants v and w such that 



J2m 2 ]< v 



i=i 



and for all integers k > 3, 



k\w k 2 



Then, for any Q <E (0, 1/w), 



Eexp 



< exp 



«c 2 



2(1-0; ' 



Proof of Theorem 3.1. For any 9 € Qk+i define the random variables 
Ti = Ti{6) = — (Yi — feiX,)) 2 + {Y, - f^X,)) 2 . 



Note that these variables are independent. We have 

n n 

5>[T 2 ] = ^ E [[2Y - f s (Xi) - fe{Xi)} 2 \f g (Xi) - feiXi)}' 



i=l 



n 

£E [[2Wi + 2f{X i ) ~ f s (Xi) - f e (X t )] 2 [f @ (X t ) - f e (Xi)] 



4=1 
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n 

< J> [[8W 2 + 2(211/1100 + L(2K + l)) 2 ] lfg(X t ) - f e (Xi)f 

1=1 

n 

= £]E [8W? + 2(211/1100 + L(2if + l)) 2 ] E [[/ e -(X 4 ) - / (AQ)] : 

i=l 

< n [8a 2 + 2(211/1100 + L{2K + l)) 2 ] - R{9)] =: v(9, 9) = v, 

where we have used in the last line Pythagore's Theorem to prove \\fg — f§\\ 2 = 
R{9) — R(9). Next we have, for any integer k > 3, that 

n n 

J2 E [Pi) k +] < E E [l 2y * " f§(Xi) ~ /*(*0I* \fs(Xi) - fe(X t )\ k 

i=l i=l 
n 

<E E I 22 "' 1 [l^l fe + (11/11- + L ( K + V2)) fe ] \f§{Xi) - MXjf 

z=l 

n 

< [2 2fe - 1 [|^| fc + (||/|| co + L(A-+l/2))' £ ] [L(2i^+l)] fc - 2 [/ ff (X i )-/fl(X i )] s 

< 2 2 *- 1 [a 2 k\i k - 2 + (H/lloo + L(if + l/2)) fc ] [L(2X + l)] fe ~ 2 

n 

x^E[[/ e -(X,)-/ e (X 4 )] 2 " 
»=i 

(a 2 fc!C fc - 2 + (ll/lloo + £(g + l/2)) k )(4L(2K + l)) fc - 2 
4(^ + (||/|| 00 +i(^ + l/2))2) 

< I (fci^-2 + [ll/IU + L(K + l/2)] k - 2 ) [4L(2K+ l)f~ 2 v 

h.\ ln k—2 

< ~ A k\ (£ + [ll/lloo + L(K + l/2)]f- 2 [4L(2K + 1)}^ < V^—, 

with w := 8(£ + [||/||oo + L{K + l/2)])L{K + 1/2). 

Next, for any A S (0, n/w) and 9 <E S/f+i, applying Lemma 4.1 with £ = X/n 
gives 



, cxp 



A(l2(0)-.R(0)-r(0)+r(0)) 



< exp 



vX 2 



2n 2 (l — — ) 



Set C = 8 (cr 2 + [ll/lloo + L(/v + 1/2)] 2 ) . For the sake of simplicity let us put 



0= A 



For any e > the last display yields 



X 2 C 



2n(l 



(4.5) 



Ecxp 



/?(i?(0) - i?(0)) + x(-r(6) + r(9)) - log 



< 
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Integrating w.r.t. the probability distribution m(-) we get 



Ecxp 



p(n{6) - R{efj + \(-r(6) + r(ej) - loj 



m(d6) < 



Next, Fubini's theorem gives 



E / exp 
= E / exp 



P(r{0) - R(0)) + x(-r{9) + r(0)) - log ■ 
i(r(8) - R(0)\ +\(-r(6) + r(9)) - log 



m{dO) 
dpx 



dm 



(0) 



log 



Px{d9) 

£ 

< -. 



Jensen's inequality yields 



f3 / Rdp x - R(5) + A - / rdp x + r{9) - K[p Xi m) - log- 



Eexp 



Now, using the basic inequality exp(x) > 1r, (x) we get 



P / Rdp x - R{9) + A - / rdpx + r(8) - fC(p x , m) - lo. 



£ 

< -. 

~ 2 



> S < 



Using Jensen's inequality again gives 



Rdpx > R 



x(d6) = R(6 X ). 



Combining the last two displays we obtain 

/ rdp x - r(0) + m)+ log f] 



R(0 X ) - R(0) < 



> 1 



Now, using Lemma 1.1.3 in Catoni [14] we obtain that 

J rdp - r(6) + I [K(p,m)+ log I] 



R(0x) - R{9) < inf 

peMXi&K+i) 



> 1 - -. 
2 



(4.6) 

We now want to bound from above r(9) — r(9) by R(9) — R(9). Applying 
Lemma 4.1 to Ti{9) = —Ti(9) and similar computations as above yield succes- 
sively 



Ecxp 



X[R(9) - R(9) + r(9) - r(9) 



< exp 



v\ 2 
2n 2 (l- 
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and so for any (data-dcpcndcnt) p, 
E exp 



7 



Rdp + R{6) ) + A ( / rdp - r(6) ) - JC(p, m) - log - 



< 



where 



Now, 



7= U + 



A 2 C 



2n(l-a±) 



(4.7) 



rdp — r{9) < 



Rdp - R(9) 



IC(p, m) + log 



>1~2- ( 4 ' 8 ) 



Combining (4.8) and (4.6) with a union bound argument gives 



P<^ R(6 X ) - R(0) 



< inf 

P&M\(<S>k+i) 



7 [/ gdp - ggj] + 2 [/C(p, m) + log ■ 

/3 



>l-e, 



where A1i_(Oi<r+i) is the set of all probability measures over &k+i- 

Now for any S G (0, 1] taking p as the uniform probability measure on the set 
{* G Q(J(§)) : |t - 6\x < 6} c K+1 (J(8)) gives 



R0 X ) < R(0) + ^ 



7 L 2 <5 2 + 2 |J(0)|loj 



A"+ 1 



|j(0)|logi + log(— ^- 
a \ 1 — a 



log 



P_ 
\J(0)\ 



log- 



> 1 -e. 



Taking <5 = n 1 and the inequality log C | | ) — r^f)i gi ycs 

AC 



P{ i?(0 A ) < i?(0) 



1 



AC 
2(n-tuA) 



1 



L 2 

2(n — wA) y ?i 2 



-| (i jwi io g + 1) + i^wi io g (jj^) + io t v 



> 1-e 
(4.9) 



where we replaced 7 and /3 by their definitions, see (4.5) and (4.7). Taking now 
A = nj (2Ci) (where we recall that C\ — C V w) in (4.9) gives 
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P<^ R(k) < R(0) + V + — I J WI lo S ( K + !) 



e(l - a) 



2 



) 



) 



) 




where we have used that 1 — 



xc 



> 1/2 and 1 + 



XC 



< 3/2. 



□ 



2(n-wX) 



2(n-wX) 
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